Abstract

In this project, we are using a dataset to explore following questis:
Is there a significant difference in income between men and women?
Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?
Now, let’s begin!

Loading some necessary packages and basic settings here.

Data Summary

(a) Data Loading

We are using the NLSY79 (National Longitudinal Survey of Youth, 1979 cohort) data set. The NLSY79 data set contains survey responses on thousands of individuals who have been surveyed every one or two years starting in 1979.

Loading the whole dataset.

## Parsed with column specification:
## cols(
##   .default = col_double()
## )
## See spec(...) for full column specifications.

(b) Data Selecting

For the convenience of analysis, we would like to generate a sub-dataframe.

We will do this step by step:

Step 1: What are the variables we interested in?

This dataset contains lots of historical data, to be specific, we want to study on a cross sectional dataset of year 2012, so we are including some variables from 2012 survey mainly, and basic information of subjects such as race, gender in historical survey (which remain stable for each subjects). We will name our subdataset as gwgap(stands for “gender wage gap”).

Below are the selected variables and brief description:

income: total income from wage and salary in 2012
gender: gender, Male/Female
race: racial group, Hispanic/Black/Non-black
educ_year: education years, continus variable of highest grade completed
region: rigion of residence
num_job: Number of jobs ever reported at interview date
urban: urban or rural residence
industry: recoded based on “four sectors theory”: primary sector/secondary sector/tertiary sector/quaternary sector/puublic sector.

Step 2: How do we deal with missing values?

What about factor varibles?

We will assign other factor variables(not are they at the present) whose value is negative to NA, we don’t simply omit them so far, beacuse we may want to think carefully about whether these values are informative.

income gender race educ_year region num_job urban industry
Min. : 45 Min. :1.000 Min. :1.000 Min. : 3.00 Min. :1.000 Min. : 1.00 Min. :0.000 Min. : 170
1st Qu.: 23000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:12.00 1st Qu.:2.000 1st Qu.: 7.00 1st Qu.:1.000 1st Qu.:4670
Median : 40000 Median :1.000 Median :3.000 Median :13.00 Median :3.000 Median :11.00 Median :1.000 Median :7380
Mean : 54774 Mean :1.496 Mean :2.347 Mean :13.67 Mean :2.652 Mean :12.37 Mean :0.822 Mean :6273
3rd Qu.: 65000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:16.00 3rd Qu.:3.000 3rd Qu.:16.00 3rd Qu.:1.000 3rd Qu.:8190
Max. :343830 Max. :2.000 Max. :3.000 Max. :20.00 Max. :4.000 Max. :58.00 Max. :2.000 Max. :9990
NA NA NA NA NA’s :33 NA NA’s :33 NA’s :105

As we can see, there are NA values in region, urban, and industry.

To move a step further, we want to know the distribution of these NA values in the dataframe, we will use function aggr in the package VIM here.

Plots above is a visualization of NA values in the data frame, there is an interesting pattern: missing value of geographical variables (region and urban) and those of industryare completely non-overlapping. Since we may want to detect geographysical features of gender wage gap in this report, by looking up the nlsy description documentant, we can probably guess that the reason of missing geographical variables is “Non-interview” or “valid skip”, and also because numbers are realtively small, we can infer that those values are not informative, so we just omit them.

But for 105 missing industry variables, we cannot simmply omit them for 2 reasons, 1) they might be highly informative, as unknown/unreported industry may indicate a strong relationship with income. 2) the number of missing values are relatively large.
We will deal with these NA values below.

Step 3: Now let’s do some recoding to give variables more meaingful names of levels

Some simple releveling first
## 'data.frame':    5146 obs. of  8 variables:
##  $ income   : num  19000 35000 105000 40000 75000 102000 70000 60000 150000 115000 ...
##  $ gender   : num  2 2 1 2 1 2 2 1 1 2 ...
##  $ race     : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ educ_year: num  12 12 16 14 14 19 13 13 13 17 ...
##  $ region   : num  1 4 2 1 1 1 1 1 4 1 ...
##  $ num_job  : num  3 17 12 13 5 24 12 14 4 12 ...
##  $ urban    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ industry : num  8680 8990 7870 6870 8770 6990 7270 9570 4170 7290 ...
Education

In classical human capital theory, education is one of the most predictor of future income, so we want to include it in our analysis. Here the education is continuous variables indicating eductional years, as we all know the effect of education may not follow a continuous pattern, this is some undermining points of regression model below. We will mutate education as a factor variable afterwards. But in the final regressional model, we will still use numerical education year varible, assumpting human capital accumulation is continuous along with education. Any way, let’s look at educ_year:

## 
##    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17 
##    6    4    2   12   13   39   84   69  100 2120  474  549  291  712  181 
##   18   19   20 
##  254  104  132

As we can see, our sub data frame doesn’t have special values such as “93” indicating PRE-KINDERGARTEN, etc. As assumpted above, we don’t need to do anything to educ_year at the present.

(c) Descriptive Statistics

Tablur Summary

income gender race educ_year region num_job urban industry
Min. : 45 Male :2593 Other :2746 Min. : 3.00 Northeast : 794 Min. : 1.00 Urban:4117 Tertiary :2448
1st Qu.: 23500 Female:2553 Black :1452 1st Qu.:12.00 Northcentral:1247 1st Qu.: 7.00 Rural:1029 Primary : 87
Median : 40000 NA Hispanic: 948 Median :13.00 South :2097 Median :11.00 NA Secondary :1009
Mean : 54932 NA NA Mean :13.68 West :1008 Mean :12.37 NA Quaternary:1106
3rd Qu.: 65000 NA NA 3rd Qu.:16.00 NA 3rd Qu.:16.00 NA Public : 394
Max. :343830 NA NA Max. :20.00 NA Max. :58.00 NA Unknown : 102

From the table above, We can get an overview of the variables in our data. We can see that gender is almost equally distributed in our sample dataset, which is a good news. Proportion of Race is similar as general population in US, Urban/Rural proportion is about 4 : 1. For other variables, let’s get a closer look by plotting.

Overviewing Data by ploting

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we can see, income distribution is right-skewed, and there are some outliers in far most side, which is top coded value. We will deal with them later.

Above is the distribution of subjects residence region, south is the most common. (But Note we chose Northeast as baseline group as to be more standard for the analysis. )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Job number distribution is as expected, right-skewed.

## Warning: Ignoring unknown parameters: binwidth, bins, pad

As we can see, most people works in Tertiary Industry.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we can see, most are high school graduate. We will use this pattern to category education as a factor varible to conduct a further analysis in next part.

Methodlogy

(a) Exploring Gender Wage Gap By Visualization

Since we are mainly interested in the gender wage gap across various factors, first we will look into some plot revealing this question.

Across-All Gender Wage Gap

As we can see from the box plot, there is a obvious difference between male and female.

As we can see in the table, male are more likely to be in a higher income level(>50k), this may indicating some potential factors contributing to the gender wage gap. We may try to reveal it later.

As we can say, in both income level, gender wage gap is apparent.

As we can see, gender wage gap seems obvious again, and appears even across regions.

As we expected, gender wag gap is signficiant in both area.

So far, all selected variables seems to have some substantial relationship with gender wage gap. Limited by time, we cannot go through all of them in this report. Now let’s see industry variable.

It’s interesting, since the gap seems varies a little across different industry, in the category “Unknown”, there seems no difference. Industry can be a interesting variable and I want to explore it further in the next part.

Findings

In this part, we will do some regressions and explore a step further to look at favtors that influencing gender wage gap.

To begin with, we need to consider the potential problem of colinearity, by doing some correlation analysis here.

(a) Deal With Colinearity

We are using a function from package “PerformanceAnalytics”, which can directly draw a correlation plot.

Here we are only review correlationship between numeric vairables.

As we can see, we can not say that education years and number of jobs are not correlated, they have a strong relationship. But actually in my regression, I find both variables signficant. But base on current knowedge, I could not know the reason of this as well as how to deal with it, so I included them in the model anyway.

(b) Deal With Top-coded Values

For the problmes of top-coded values, recall that there are a little peak in the far right side in income distribution plot. These values can have a big impact on the model. So we want to deal with this.

We will begin with two scatter plots.

## [1] 343830

As we can see, fitted line become more balanced after excluding top-coded values.

Now let’s do the comparision by look into the out put of regression.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -27425.285 4897.108 -5.600 0.0000
educ_year 8780.012 304.089 28.873 0.0000
genderFemale -31168.949 1490.484 -20.912 0.0000
raceBlack -16829.715 1771.255 -9.502 0.0000
raceHispanic -4857.612 2096.455 -2.317 0.0205
num_job -1088.122 105.143 -10.349 0.0000
regionNorthcentral -7631.277 2339.049 -3.263 0.0011
regionSouth -4029.257 2167.072 -1.859 0.0630
regionWest -3345.379 2500.935 -1.338 0.1811
urbanRural -7741.045 1865.723 -4.149 0.0000
industryPrimary 2836.328 5637.491 0.503 0.6149
industrySecondary 3880.315 1990.390 1.950 0.0513
industryQuaternary 6466.277 1932.436 3.346 0.0008
industryPublic 3847.111 2794.268 1.377 0.1686
industryUnknown 2639.873 5173.980 0.510 0.6099
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5799.263 2763.288 -2.099 0.0359
educ_year 5381.374 174.726 30.799 0.0000
genderFemale -16724.922 842.440 -19.853 0.0000
raceBlack -9710.468 991.637 -9.792 0.0000
raceHispanic -2830.928 1176.850 -2.406 0.0162
num_job -742.275 58.757 -12.633 0.0000
regionNorthcentral -5253.426 1319.805 -3.980 0.0001
regionSouth -3000.390 1223.063 -2.453 0.0142
regionWest -902.031 1410.856 -0.639 0.5226
urbanRural -4040.377 1044.889 -3.867 0.0001
industryPrimary 8131.177 3143.531 2.587 0.0097
industrySecondary 7842.333 1118.715 7.010 0.0000
industryQuaternary 5159.093 1096.290 4.706 0.0000
industryPublic 13045.643 1555.830 8.385 0.0000
industryUnknown 6742.220 2897.005 2.327 0.0200

As we can see in the two tables above, nearly all of cofficients become signficant after droping top-coded values.

(c) Regression Model Construction

As we notice above, region west is not significant, so we may ask the question: is variable region effective/signficaiant when we add it into the model?

We will do an anova test to see if including region is significant.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -8632.056 2628.379 -3.284 0.0010
educ_year 5391.895 174.998 30.811 0.0000
genderFemale -16854.317 843.162 -19.989 0.0000
raceBlack -9779.126 953.268 -10.259 0.0000
raceHispanic -1840.222 1118.128 -1.646 0.0999
num_job -733.932 58.542 -12.537 0.0000
urbanRural -4440.054 1035.694 -4.287 0.0000
industryPrimary 7712.301 3146.530 2.451 0.0143
industrySecondary 7387.640 1115.468 6.623 0.0000
industryQuaternary 5040.822 1097.305 4.594 0.0000
industryPublic 13163.886 1556.440 8.458 0.0000
industryUnknown 6923.084 2899.706 2.388 0.0170

Output seems not bad except that coefficent of race hispanic become non-significant.

Now let’s do the anova test:

## Analysis of Variance Table
## 
## Model 1: income ~ educ_year + gender + race + num_job + urban + industry
## Model 2: income ~ educ_year + gender + race + num_job + region + urban + 
##     industry
##   Res.Df           RSS Df   Sum of Sq      F    Pr(>F)    
## 1   4992 3993144369162                                    
## 2   4989 3977556402529  3 15587966634 6.5173 0.0002139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see from the test result, it is highly significant. So we should include region in our regression.

Now let’s do some interpretation of the regression model:

Seen from regression diagnostic plots, we can see that the residuals did not have a completely constant varience, which indicating potential non-linear relationship between income and other variables.

From the normal qq plot, we can see clearly that on the higher side of the income, the residuals did not follow a normally distributed pattern, but for the rest part, it is pretty good. 

We can draw similar conclusion about non-linear relationship from scale-location plot.

For the residuals vs leverage plot, because we have excluded top-coded values, the result seems pretty good.

Overall, model fitting is significant(F-statistics is 138.5671658 and p_value is nearly 0, even the r squared(0.2799765) is not high, which indicates that a large part of variance remain unexplained, based on a knowledge of standard wage model, the major missing variables could be: working years, job training reperience, etc. 
Below are the intepretation of some cofficients:

Intercept: in this model, it indicates that -5799.26$ is the basic earning of such a featured person: male, non-black or hispanic, live in northeast of the country, with no education, live in urban area, having no job experience, expecting working in tieratiry industry. It is very hard to find such a person in reality, but due to the fact the income level here is negative, the interperation itself is somehow meaningless.

Race: As we can see, all else held equal, comparing to the baseline group, race black and race hispanic earns signficantly less the other group, respectively: black (-9710.47$), hispanic (-2830.93$).

num_job: all else held equal, one more job numbers is associated with a deduction of -742.27$ on average, frequently changing job may be indicators of some non-reliable characters of one person, which may result in lower income. 

region: all else held equal, we can say that comparing to the Northeast group, people in other South and Northcentral part of the country earn less, for the west part, the coef is not significant, this might because West part is as developed as Northeast in recent years, but the variance of income could be large, because there are still many rural areas in West.

urban/rural: the coef is as expected, people in rural area earns -4040.38$ less on average. 

industry: the baseline group here is tertiary sector, people in other industry all earn more, which is not interesting. This may be related to the denefination of industry in our dataset here. An impressive pattern is that, people in public sector earn 13045.64$ more than baseline group. 

(d) Exploring Gender Wage Gap Across Various Industries

Step 1: Is There A Significant Gender Wage Gap In Different Industries?

gender industry mean.income lower upper
Male Tertiary 45455.43 43510.71 47400.15
Male Primary 54169.11 44844.43 63493.80
Male Secondary 54678.83 52277.99 57079.66
Male Quaternary 70768.63 66692.63 74844.62
Male Public 70814.03 65521.89 76106.17
Male Unknown 50681.80 40588.48 60775.12
Female Tertiary 34780.21 33297.74 36262.68
Female Primary 31955.25 12695.58 51214.92
Female Secondary 40146.41 36593.14 43699.68
Female Quaternary 45038.05 42788.03 47288.08
Female Public 49635.20 45843.26 53427.14
Female Unknown 48986.87 38671.00 59302.73

Using the table above, we can draw some graphs.

As we can see in the graph, there are wage gap in nearly all industry groups except “Unknown”, but for some group, the confidence interval of mean income is wide, indicating that the gender wage gap may vary dramatically.

To be more clear, we will look directly into the gender wage gap of industry groups and draw the error bar to see if they are significant.

industry diff.sig
Tertiary 0.0000
Primary 0.0391
Secondary 0.0000
Quaternary 0.0000
Public 0.0000
Unknown 0.8139

As we can see, in group primary and unknown, gender wage gap is not significant, so far.

continue with plotting:

industry income.diff lower upper
Tertiary 10675.220 8231.055 13119.38
Primary 22213.864 1204.646 43223.08
Secondary 14532.416 10250.566 18814.27
Quaternary 25730.574 21079.106 30382.04
Public 21178.838 14686.733 27670.94
Unknown 1694.933 -12555.061 15944.93

The plot is saying the some conclusion but more vivid.

Step 2: A Closer Look: Interaction analysis

Now we are interested in: how does industry interacts with gender to affect income?

An advantage of analyzing interaction terms is we can decompose the effect that industry has on gender wage gap by control all other substantial variabls that influcencing income.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -6525.163 2784.709 -2.343 0.0192
educ_year 5319.822 175.122 30.378 0.0000
genderFemale -14325.203 1170.200 -12.242 0.0000
raceBlack -9635.984 990.867 -9.725 0.0000
raceHispanic -2867.767 1175.324 -2.440 0.0147
num_job -731.217 58.756 -12.445 0.0000
regionNorthcentral -5262.700 1318.276 -3.992 0.0001
regionSouth -3007.389 1222.301 -2.460 0.0139
regionWest -784.401 1410.027 -0.556 0.5780
urbanRural -3974.361 1044.728 -3.804 0.0001
industryPrimary 10784.818 3508.788 3.074 0.0021
industrySecondary 8868.478 1355.592 6.542 0.0000
industryQuaternary 10777.671 1785.127 6.037 0.0000
industryPublic 15705.518 2308.690 6.803 0.0000
industryUnknown 5144.453 3911.232 1.315 0.1885
genderFemale:industryPrimary -9675.723 7906.926 -1.224 0.2211
genderFemale:industrySecondary -1037.806 2431.606 -0.427 0.6695
genderFemale:industryQuaternary -8659.843 2184.793 -3.964 0.0001
genderFemale:industryPublic -4654.216 3093.511 -1.505 0.1325
genderFemale:industryUnknown 4209.085 5795.431 0.726 0.4677

It’s interesting, only one coefficient of interaction terms is signficant.

Is it ineffective to include the interection term?

Let’s conduct anova analysis.

## Analysis of Variance Table
## 
## Model 1: income ~ educ_year + gender + race + num_job + region + urban + 
##     industry
## Model 2: income ~ educ_year + gender + race + num_job + region + urban + 
##     industry + gender:industry
##   Res.Df           RSS Df   Sum of Sq      F   Pr(>F)   
## 1   4989 3977556402529                                  
## 2   4984 3962515384400  5 15041018129 3.7837 0.002016 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

See! It’s significant, so what does this mean?

Well, considering plotting analysis above, we can infer that though the gender wage gap is significant in many industry groups, after controling for some variables in the regression, we find that in this quaternary sector group, the gender wage gap still remain signficant. We can infer to things from this, difference between industry groups may be accounted for by other varibles, which our model successfully captured, so most of the gap become non-signficant, but there is still, for some reason, the gender wage gap is significant in quaternary group. And the reason is unclear, we have to do further research to find out why.

The intepration of the coffecient is that all others held equal, men in quaternary earns 8660$ more than women on average.

The difinenation of “Quaternary sector” is: a knowledge-based part of the economy, which typically includes knowledge-oriented economic sectors such as information technology, media, research and development; information-based services such as information-generation and information-sharing; and knowledge-based services such as consultation, education, financial planning, blogging, and designing.

Here comes the most interesting thing! Above are almost exactly the business or jobs we would go to after graduation.

Discussion

Overall, we can draw the conclusion that gender wage gap is almost universal (when comparing man and women in same condition), at least for the factors we included in the model.

Another fact is there are many factors that can have interact or overlapping effect on gender wage gap, some of the mechanism we do know the reason, for some other, like the one we explored in the report “industry”, the reasons of the significance of the interaction term needs further more research on.

I have like 70% confidence on my analysis, one most important deduction is that we didn’t include possibily most powerful factor, which is job-realted experience or capacity level, that influences income. And the regression model itself can also be potentially improved by adding or replacing some varibles or interaction terms. What’s more, as we mentioned before, non-linear pattern of the effect of education can be a potential weak point of the analysis.